The Zillow Prize is a Kaggle competition that aims to inspire data scientists around the world to improve the accuracy of the Zillow "Zestimate" statistical and machine learning models.
My goal is to compete for the Zillow prize and write up my results.
The data were obtained from Kaggle website and consist of the following files:
properties_2016.csv.zipproperties_2017.csv.zipsample_submission.csvtrain_2016_v2.csv.ziptrain_2017.csv.zipzillow_data_dictionary.xlsx
The zillow_data_dictionary.xlsx is a code book that explains the data.
This data will be made available on figshare to provide an additional source if the Kaggle site data become unavailable.Data analysis was done in Jupyter Notebook (Pérez and Granger 2007) Integrated Development Environment using the Python language (Pérez, Granger, and Hunter 2011) and a number of software packages:
NumPy (van der Walt, Colbert, and Varoquaux 2011)
pandas (McKinney 2010)
scikit-learn (Pedregosa et al. 2011)
The following packages were used to visualize the data:
Matplotlib (Hunter 2007)
Seaborn (Waskom et al. 2014)
Reproducibility is extremely important in scientific research yet many examples of problematic studies exist in the literature (Couzin-Frankel 2010).
The names and versions of each package used herein are listed in the accompanying env.yml file in the config folder.
The computational environment used to analyze the data can be recreated using this env.yml file and the conda package and environment manager available as part of the Anaconda distribution of Python.
Additionally, details on how to setup a Docker image capable of running the analysis is included in the README.md file in the config folder.
The code in the form of a jupyter notebook (01_zillow_MWS.ipynb) or Python script (01_zillow_MWS.py), can also be run on the Kaggle website (this requires logging in with a username and password).
More information on the details of how this project was created and the computational environment was configured can be found in the accompanying README.md file.
This Python 3 environment comes with many helpful analytics libraries installed It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python (a modified version of this docker image will be made available as part of my project to ensure reproducibility). For example, here's several helpful packages to load in
Input data files are available in the "../input/" directory.
Any results I write to the current directory are saved as output.
In Progress
In Progress
Distribution of Target Variable:
Log-errors are close to normally distributed around a 0 mean, but with a slightly positive skew. There are also a considerable number of outliers, I will explore whether removing these improves model performance.
Proportion of Missing Values in Each Column:
There are several columns which have a very high proportion of missing values. It may be worth analysing these more closely.
For submission we are required to predict values for October, November and December. The differing distributions of the target variable over these months indicates that it may be useful to create an additional 'transaction_month' feature as shown above. Lets have a closer look at the distribution across only October, November and December.
Proportion of Transactions in Each Month
Feature Importance
Here we see that the greatest importance in predicting the log-error comes from features involving taxes and geographical location of the property. Notably, the 'transaction_month' feature that was engineered earlier was the 12th most important feature.
In Progress
Couzin-Frankel, J. 2010. “Cancer Research. As Questions Grow, Duke Halts Trials, Launches Investigation.” Science 329 (5992): 614–15.
Hunter, J. D. 2007. “Matplotlib: A 2D Graphics Environment.” Computing In Science & Engineering 9 (3): 90–95.
McKinney, W. 2010. “Data Structures for Statistical Computing in Python.” In Proceedings of the 9th Python in Science Conference, edited by S. J. van der Walt and K. J. Millman. Austin, Texas.
Pedregosa, Fabian, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, et al. 2011. “Scikit-Learn: Machine Learning in Python.” Journal of Machine Learning Research 12 (Oct): 2825–30.
Pérez, F., and B. E. Granger. 2007. “IPython: A System for Interactive Scientific Computing.” Computing in Science & Engineering 9 (3): 21–29.
Pérez, F., B. E. Granger, and J. D. Hunter. 2011. “Python: An Ecosystem for Scientific Computing.” Computing in Science & Engineering 13 (2): 13–21.
Van der Walt, S., S. C. Colbert, and G. Varoquaux. 2011. “The NumPy Array: A Structure for Efficient Numerical Computation.” Computing in Science & Engineering 13 (2): 22–30.
Waskom, M, O Botvinnik, P Hobson, J Warmenhoven, JB Cole, Y Halchenko, J Vanderplas, et al. 2014. Seaborn: Statistical Data Visualization. Stanford, California.